Ford GoBike System Data Exploration

Preliminary Wrangling

This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area.

What is the structure of your dataset?

There are 183,412 unique bike_Id numbers in the dataset with 21 features (duration_sec, start_time, end_time, start_station_id, start_station_name, start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude, end_station_longitude, bike_id, user_type, member_birth_year, member_gender, bike_share_for_all_trip, check_in_hour, date, day_of_week, distance_km, and, ReturnPoint). Most variables are numeric in nature, but the variables user_type, member_gender, bike_share_for_all_trip, check_in_hour, day_of_week, ReturnPoint, member_birth_year are categorical variables with the following levels.

Note) member_birth_year and day_of_week are also applied both variables for investigation purposes.

What is/are the main feature(s) of interest in your dataset?

I'm most interested in figuring out what features are best for predicting the business revenue and asset(bike) depreciation by referring to the duration time and distance. In addition, I am also interested in finding marketing points that are likely to improve efficiency and effectiveness in the business.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

I expect the type of users will have the most potent effect on duration and distance because subscribers are more likely to use the bike than customers along long stretches. I also think this will be variable depending on the day of the week.

Univariate Exploration

I'll start by looking at the location of bike stations for selected 1000 places randomly.

Most of the stations are located in three densely populated zones around San Francisco Bay. I assume that the primary purpose of using the bike is commuting from home to the office or neighborhood shopping, which is not just for exercise.

Duration forms almost standard distribution, where the average time is about 12.1 minutes. This could possibly mean that my initial assumption on the purpose of use is correct.

Because the average use time is less than 10 minutes, proportionally, the average distance is 1.69 kilometers with a long-tailed distribution skewed to the right.

Most users park the bike where 1 to 2 kilometers away from the initial station. About 2.1% of checked-in users did not use the bike.

Subscribers are almost eight times as many as non-subscribers, and those are mostly male who is not sharing the bike during their entire trip.

User's age is distributed mainly between late 20's and early '30s and, and they mostly used the bike during the commuting hours. There were more users during weekdays than weekends. The highest demand day and the lowest demand day are on Thursday and Saturday, respectively, showing a bimodal pattern.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

The duration(sec) took on a large range of values, so I looked at the data using a log transform. Under the transformation, the data looked bimodal, with one peak between 8 minutes and 12 minutes.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The longitude and latitude values between the start and end stations are applied to the formula to get the distance. The calculated distance is the ultimate shortcut in the kilometer scale. If we could have gathered data based on GPS, we would have a more precise result.

Bivariate Exploration

To start off with, I want to look at the pairwise correlations present between features in the data.

The younger the users are, the more diverse duration of use exists, and we can also find that the distance is not proportional to the duration, which provides evidence that only because the duration is long, it does not necessarily require that we need to increase this into the depreciation factor.

Interestingly, the subscribers less used the bike than non-subscribers, which is probably a vital marketing point because we can increase the profit margin, revenue minus cost, with the membership contract. Although we figured out there are fewer users during the weekends, but they used more time. With these findings, we possibly recommend last-In bikes(high-tier inventory) to first-Out(serve) to weekend users with premium memberships.

Subscribers used the bike 4 to 5 minutes less than customers, and most used the bike for about 10 to 11 minutes, which means they live with a specific daily routine. Because weekdays users have more distinctive diamond shapes on the plot, we can assume that subscribers usually use the bike during the weekdays. In addition, users who checked in at 6 in the morning would be faithful clients because they already got used to commuting by bike.

Interestingly, non-subscribers used longer distances than subscribers, and so do those who do not use the bike-sharing services during the entire trip.

Non-subscribers used the bike about 200 meters shorter than subscribers, and those who do not use bike-sharing services during the entire trip used the bike 400 meters shorter than those who use the bike during the whole trip. In addition, users who checked in at 4 in the morning used the bike in the shortest distance, whereas users who checked in at 5 in the morning and used the bike above the average distance.

As we predicted earlier, most subscribers used the bike during the weekdays.

Most users used the bike during commuting hours regardless of gender, user types, bike-sharing, and weekdays. However, users checked in around noon during the weekends.

The duration per member's birth year forms a normal distribution to the 8.6-minute point; however, the distance plot is skewed to the left. This possibly explains that the age group between mid-'20s and mid-'30s are remarkable active users.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The younger the users are, the more diverse duration of use exists, and we can also find that the distance is not proportional to the duration, which provides evidence that only because the duration is long, it does not necessarily require that we need to increase this into the depreciation factor. Interestingly, the subscribers less used the bike than non-subscribers, which is probably a vital marketing point because we can increase the profit margin, revenue minus cost, with the membership contract. Although we figured out there are fewer users during the weekends, but they used more time. With these findings, we possibly recommend last-In bikes(high-tier inventory) to first-Out(serve) to weekend users with premium memberships. Subscribers used the bike 4 to 5 minutes less than customers, and most used the bike for about 10 to 11 minutes, which means they live with a specific daily routine. Because weekdays users have more distinctive diamond shapes on the plot, we can assume that subscribers usually use the bike during the weekdays. In addition, users who checked in at 6 in the morning would be faithful clients because they already got used to commuting by bike. Interestingly, non-subscribers used longer distances than subscribers, and so do those who do not use the bike-sharing services during the entire trip. Non-subscribers used the bike about 200 meters shorter than subscribers, and those who do not use bike-sharing services during the entire trip used the bike 400 meters shorter than those who use the bike during the whole trip. In addition, users who checked in at 4 in the morning used the bike in the shortest distance, whereas users who checked in at 5 in the morning and used the bike above the average distance. As we predicted earlier, most subscribers used the bike during the weekdays.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Most users used the bike during commuting hours regardless of gender, user types, bike-sharing, and weekdays. However, users checked in around noon during the weekends. The duration per member's birth year forms a normal distribution to the 8.6-minute point; however, the distance plot is skewed to the left. This possibly explains that the age group between mid-'20s and mid-'30s are remarkable active users.

Multivariate Exploration

The main thing I want to explore in this part of the analysis is how the three categorical measures of quality play into the relationship between price and carat.

Most users are younger than the mid-'30s; however, the older age group starts occupying the ratio after 4 a.m., primarily non-subscribers.

It is most difficult to predict for users who checked in between 2 a.m. and 4 a.m. how long the users will use bikes and how far away they will check out; however, they are primarily non-subscribing mixed-genders.

Users who checked in between 7 a.m. and 8 a.m. drove the bike the longest distance, and most of them are non-subscribers.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

It is most difficult to predict for users who checked in between 2 a.m. and 4 a.m. how long the users will use bikes and how far away they will check out; however, they are primarily non-subscribing mixed-genders. Users who checked in between 7 a.m. and 8 a.m. drove the bike the longest distance, and most of them are non-subscribers.

Were there any interesting or surprising interactions between features?

Most users are younger than the mid-'30s; however, the older age group starts occupying the ratio after 4 a.m., primarily non-subscribers.